Across atoms to crossing continents: Application of similarity measures to biological location data

Biological processes involve movements across all measurable scales. Similarity measures can be applied to compare and analyze these movements but differ in how differences in movement are aggregated across space and time. The present study reviews frequently-used similarity measures, such as the Hausdorff distance, Fréchet distance, Dynamic Time Warping, and Longest Common Subsequence, jointly with several measures less used in biological applications (Wasserstein distance, weak Fréchet distance, and Kullback-Leibler divergence), and provides computational tools for each of them that may be used in computational biology. We illustrate the use of the selected similarity measures in diagnosing differences within two extremely contrasting sets of biological data, which, remarkably, may both be relevant for magnetic field perception by migratory birds. Specifically, we assess and discuss cryptochrome protein conformational dynamics and extreme migratory trajectories of songbirds between Alaska and Africa. We highlight how similarity measures contrast regarding computational complexity and discuss those which can be useful in noise elimination or, conversely, are sensitive to spatiotemporal scales.

through-the-Earth distances for the bird trajectories, which differ from the great-circle distances at continental scales but do not affect relative distances arising from the comparison of point sets.
Quantifying similarity usually also involves detecting outliers, which are fixes that do not follow the movement pattern of their predecessors and successors. One should distinguish between temporal outliers, which could well follow the underlying movement pattern in terms of position but differ in time, and spatial outliers, where a fix is placed further away in space from other fixes which are temporally close.
The sections below discuss the different similarity measures for comparing two trajectories, that assume to have n and m timesteps respectively. Note that for certain measure, it might be necessary that n = m, as discussed below.
Time-dependent analyses The following similarity measures all feature a temporal component, i.e., the timestamps are taken into consideration. For DFD and DWFD, the actual timestamps are not relevant; but the data points are given as an ordered sequence in space, whereas DTW and LCSS allow for a temporal threshold.
Longest Common Subsequence The most straightforward similarity measure is the Longest Common Subsequence Measure (LCSS).
Given a temporal threshold δ and a spatial threshold ε, one compares two trajectories by matching fixes that have at most δ difference in time and at most ε distance in space. Note that each fix of one trajectory can be matched to at most one fix of the other trajectory. The LCSS distance counts the number of consecutive fixes that do match and divides this sum by the total number of fixes. Thus, LCSS returns a value between 1 (all fixes could be matched, the curves are identical within given threshold values) and 0 (no matching fixes could be found). In contrast, larger values for other similarity measures indicate decreasing similarity between data sets. Therefore, to better compare LCSS to the other measures, it is more illustrative to analyze 1 − LCSS instead of LCSS.
Depending on the application, it can make sense to disregard δ altogether: For the prepared bird trajectories and protein dynamics analysis, the exact timing of fixes is less important than their spatial differences; thus, it is natural to set δ = ∞.
As Figure 3 indicates, it can be challenging to determine a spatial threshold value ε providing useful insights. Unfortunately, there is no general guideline to determine the spatial threshold for arbitrary data sets, which is a major drawback of this similarity measure in the absence of predetermined scales of interest.
Like the LCSS method itself, the choice of the spatial threshold ε is highly dependent on the directedness of movement and the relevant spatial scale of biological interest in the study system. For systems characterized by undirected movement, it can become challenging to choose an appropriate ε, which would render the LCSS measure a less meaningful choice, as illustrated for the protein case study (see Fig. 3).
Formally, the following definitions apply [1]: . . , a n } and B = {b 1 , . . . b m } in R d , one defines their longest common subsequence as the longest sequence of strictly increasing indices i 1 , i 2 , . . . i k of A, such that a i = b ij holds for all j = 1, . . . , k.
LCSS is not a distance measure as such, it quantifies similarity as a proportion and consequently does not qualify as a metric, as only the symmetry condition is met. However, it is particularly useful for applications where one asks to group (parts of) data sets, as it can be easily modified to return the common subsequence of a pair of trajectories. Its sensitivity to temporal and spatial outliers depends on choices of δ and ε, which makes it highly adaptive to different applications, but also less intuitive to use.
From an algorithmic viewpoint, one first is required to compute a matrix with entries d ij , where d ij = 1 if the spatial distance of a i and b j is less than ε while they also differ by at most δ steps in time. Otherwise, d ij is set to 0. Starting in one corner of the matrix, one should then search for a path to the opposite corner of the matrix that features the maximum number of diagonal steps. Assuming one starts at d 11 , said path consists of a series of entries where each entry either lies diagonally to the lower right, directly to the right, or directly below its predecessor, such that the second entry on the path could be d 22 , d 12 or d 21 , and so on, until the element d nm is reached. The result is usually normalized, i.e., is divided by the smallest of the compared sequence lengths. Computation time with LCSS is dependant on the size of the matrix involved, as a constant number of computational steps per matrix entry is needed. Thus O(nm) time is required to compute the LCSS of two sequences of length n and m.
Fréchet/Weak-Fréchet distance Before introducing DFD and DWFD, it is useful to first describe the classic (continuous) Fréchet distance [2], a so-called bottleneck distance between parameterized curves in general. Fréchet distance has most often been applied to compare, characterize and cluster real-world trajectory data [3,4].
The following analogy is often used to describe the Fréchet distance to gain intuition. Consider a person walking their dog. Each of them follows a predefined path, and they need to start and finish their walk at the same time. The Fréchet distance equals the maximum distance between the human and the dog during their walk, provided they choose their respective speeds to minimize the maximum distance, i.e., it equals the length of the shortest possible dog leash.
More formally, one considers two sets of time stamped points, e.g., geolocation data, where p i represents a position in one, two or three dimensions, and t i are the time stamps. The points are ordered in an ascending order of timestamps. As timestamps are no longer important as soon as the data sequences are ordered, the corresponding labels can be omitted. To obtain parameterized curves, interpolation between consecutive points (fixes), should be accomplished by introducing straight line segments p i , p i+1 to obtain a connected curve for both R and S.
In the course of the Fréchet distance calculation, both curves have to be traversed in order, i.e., starting from the first fixes of the two sequences, one travels along the interpolated edges to the respective subsequent point, and so on, until both traversals reach their last fixes. The Fréchet distance equals the largest distance between a pair of fixes traversed in the process, provided the fixes are chosen to try to minimize their maximum separation.
Finally, the weak Fréchet distance, as opposed to the Fréchet distance, allows a change in direction while traversing. Thus, at times a fix might be considered which would have been impossible in the Fréchet distance calculation.
Discrete Fréchet Distance Biological data almost inevitably consists of discrete sets of locations marked with corresponding time stamps. The following variant of the Fréchet distance [5] is therefore particularly useful.

Definition 2
The discrete Fréchet distance, also known as the coupling distance, of two data sets A = {a 1 , a 2 , . . . a n } and where L is a coupling, which is defined as a sequence of point pairs . For these pairs, it holds that To gain intuition on the discrete Fréchet distance, imagine pebbles on a pond. Instead of continious paths, one is now tasked with crossing the pond by hopping from one pebble to the next one. The timestamps give a traversal order of the pebbles. Instead of a person and a dog, picture two frogs, each hopping on a set of pebbles. Both frogs need to travel in one direction (i.e., cross the pond) but can progress either simultaneously or one at a time. The sequence in which they progress is encoded by the coupling L. The question is to determine the coupling that minimizes the maximum distance between the two frogs.
The DFD is a metric, which means that it is a well defined distance measure. Let m and n be the number of timesteps in the two considered trajectories. Algorithmically speaking, one is required to compute (a part of) the distance matrix with a i and b j being the ith element of the first trajectory and the jth element of the second trajectory. Selecting d 11 in the upper left corner of the matrix, one should then determine the best path through D to connect d 11 to d nm in the bottom right corner. Starting at d 11 , this path is formed by selecting either d 12 , d 21 or d 22 , i.e., permitting navigation to the right, downwards or diagonally to the bottom right, until d nm is reached. Thus, the path encodes the coupling L. Theobjective is to find the path whose maximum entry value is as small as possible. The asymptotic runtime of the algorithm is determined by the cost of computing the distance matrix. Thus O(nm) time is needed to compute the DFD of two trajectories featuring n and m timesteps, respectively.
Discrete Weak Fréchet Distance Analogously to the Fréchet distance, DWFD allows backtracking. In the example above, the frogs are now not only allowed to hop to the next pebble but could also visit the previous pebble again. The difference in the formal definition is subtle, though:

Definition 3
The discrete weak Fréchet distance of two data sets A = {a 1 , . . . a n } and B = {b 1 , . . . , b m }, A, B ⊂ R d is defined as where L is now a semi-coupling, which is a sequence of point pairs For the pairs it now holds that This seemingly minor difference has major implications for the similarity measure. First, DWFD is not a metric, as two sequences A, B may be distinct but still have δ DW F D (A, B) = 0. On the other hand, outliers with respect to the time component hardly influence the similarity results since backtracking is allowed.
DWFD is computed by means of finding an optimal path in the distance matrix D, where it is possible to traverse upwards and to the left, i.e., considering an entry d ij on the path, the subsequent entries may be d i±1,j , d i,j±1 or d i+1,j+1 . Even though the asymptotic runtime of O(nm) is not affected by the change in algorithm, computation time increases significantly compared to DFD.
Dynamic Time Warping DTW is a technique originally designed for speech recognition [6], as it aligns similar patterns without disregarding the order in which they occur.
The technique is closely related to DFD; for two sequences A, B of timestamped fixes, i.e., locations in R 2 or R 3 , one seeks to find a coupling of points. However, unlike April 13, 2023 4/9 DFD, DTW is an aggregated rather than bottleneck measure. For DTW, a coupling that minimizes the sum of squared distances of all distance-minimizing pairs of locations is selected. More formally [7]: Definition 4 For sequences A = {a 1 , . . . , a n } and B = {b 1 , . . . b m } in R d , the dynamic time warping distance is defined as Unlike DFD, DTW is not a metric as it does not obey the triangle inequality. Since the sum of distances between matched points is compared, rather than the maximum distance of all matched pairs of points, single outliers become less significant for DTW, as compared to the DFD calculation. Compared to DWFD, DTW picks up on all types of outliers, not just the spatial ones. DTW can be computed by means of the distance matrix D.
In that case, one should determine the path corresponding to a coupling L * = min L (ai,bi)∈L d 2 (a i , b i ). Note that one can additionally introduce a parameter (called δ) that would only allow an "offset" of δ positions in the coupling to be considered, i.e., only allowing u i , v i with d(u i , v i ) ≤ δ. We set δ = ∞ for the illustrated case studies in the main manuscript and set it as a default value in the SiMBols package.
Time-independent analyses Some data sets either do not feature a temporal component at all, or their timing is not considered necessary for the specific application. The following sections therefore focus on spatial similarity only.
Specifically, the Hausdorff distance (HD), a well-known distance and similarity measure often used in shape matching [8], the Difference Distance Matrix approach (DDM), which is used in protein analysis [9,10], as well as Wasserstein distance (WD) and Kullback-Leibler divergence (KLD) are discussed. The latter two are similarity measures for probability distributions and are particularly useful when analyzing large data sets.
Hausdorff distance The Hausdorff distance is a classic distance and similarity measure used to compare and analyze geometric shapes [11]. It is related to the Fréchet distance but does not take traversal orders into account. The Hausdorff distance can be illustrated as follows: Consider two groups of people standing at fixed positions, the first group wears red clothes and the second one blue clothes. All red-clothed people identify the distance to the closest person wearing blue and vice versa. The maximum of all distances to the closest person wearing the respective other color then equals the Hausdorff distance between the two groups.
Definition 5 For two point sets P = {p 1 , . . . , p n } and Q = {q 1 , . . . q m } in R d , the Hausdorff distance is defined as The Hausdorff distance is a metric, and it is insensitive to temporal outliers as it does not feature a temporal component. Spatial outliers, however, are as easily picked up as with DFD; both are bottleneck distance measures.
April 13, 2023 5/9 Computation-wise, finding the closest point in set P for each point in Q and vice versa takes O(nm) time. Determining the largest distance among the computed ones can be done without further increasing the asymptotic runtime.
Difference Distance Matrix approach The following approach returns a vector of distances, where each entry encodes the distance, or rather the similarity, of one point in a point set to all points of a second point set. Consequently, the difference distance matrix approach is commonly used in protein analysis, where a residue in a protein is compared to all residues of a reference structure or the second protein.
For the difference distance matrix approach (DDM), one first is required to compute a matrix holding the normalized pairwise distances between two points of the same set.
Having computed the distance matrices for the two datasets, one computes the differences between their entries and build the mean value of all entries in each column. The result is a vector holding said column sums. . . , q m k }}, where p i (resp. q j ) denote the position of point p i (resp. q j ) at time . The difference distance matrix D with entries d ij for one point set is then defined as The corresponding distances between two point sets are then given by Note again that d DDM comes as an n-dimensional vector, where the ith entry encodes the distance of point p i ∈ P to the point set Q (or vice versa for interchanged roles of i and j, or P and Q). This similarity measure fulfills all requirements of a metric. The DDM method is used in protein analysis because it is remarkably robust to noise due to several normalization steps during its computation [9,10]. Single outliers are still detectable since one obtains the difference between a single point of one set to all points of the other set. Thus, it is possible to compare the distance values for all individual points of a set.
As with most similarity measures before, computation time is dominated by computing the distance matrices. There are n + m + 1 matrices of size k 2 to compute, one per time step, so the overall asymptotic runtime sums up to O((n + m)k 2 ).
Wasserstein distance The Wasserstein distance (WD) is a distance measure between normalized distributions. Thus the point sets should be treated as distributions in U, V ⊂ R d . Again, the order of points is disregarded, therefore movement trajectories are less suited for this approach. For proteins and other biochemical structures, however, the Wasserstein distance is a useful mean for analyzing similarity [12]. It is often referred to as the Earth mover's distance due to the following common analogy: Assume two piles of earth describe the given distributions. The Wasserstein distance equals the amount of work a person has to apply to shift the first pile to match the shape and position of the second pile.
April 13, 2023 6/9 Definition 7 The Wasserstein distance of two normalized distributions U = {u 1 , . . . , u n } and V = {v 1 , . . . , v n } in R is defined as where W D 0 = 0 and W D i+1 = u i + W D i − v i , i = 0, . . . , n − 1 keep track of how much "earth" needs to be moved between consecutive points in the distribution.
The Wasserstein distance is a metric for normalized distributions. Two trajectories can only be analyzed using WD if they have the same number of timesteps, n = m.
Consider two sets of points in R 3 , namely the x, y and z coordinates for all elements in the trajectories. These sets are analyzed separately: First, the x-coordinates of the two point sets are considered as distributions and their corresponding WD is computed. This procedure is repeated for the remaining y-(and z-) coordinates and all separate WD-values are added to obtain the total distance between the point sets. Temporal outliers are only relevant if the timestamps are considered as additional coordinates.
Single spatial outliers are hard to detect since the pointwise distances are summed up, while noise may increase the distance value. The WD similarity measure is particularly fast to compute, as a constant number of operations per point is performed. In total, one needs O(n) time to compute the Wasserstein distance of two sets of n points each. Note that the general definition of WD for points in R d for d > 1 is slightly more complicated, and its computation takes O(n 3 ) time by applying the so-called Hungarian algorithm [13].
Kullback-Leibler divergence The Kullback-Leibler divergence, also called relative entropy, is an asymmetric distance between a probability distribution and its reference distribution [14]. Informally, it quantifies the information loss that arises from using V as an approximation for the actual distribution U .
Definition 8 For two distributions U, V , as defined for the Wasserstein distance, the Kullback-Leibler divergence is defined as As a symmetric measure is desirable to compare two sets of equally obtained data, we employ a smoothed, symmetric version: Intuitively, KLD quantifies how well one can distinguish between the two given distributions. The Kullback-Leibler divergence itself is not a distance metric due to its asymmetric nature. The definition of KLD in Eq. (9) resolves the asymmetry. However, the triangle inequality does not hold, so KLD still does not qualify as a distance metric. Note that all values in the considered data sets must be strictly positive, which is not necessarily the case for protein data, for example. Therefore, before applying KLD, all points in the compared data sets are translated such that the minimum coordinate value for any point equals one. This measure is highly aggregated, which makes it tolerant to noise.
Computing the KLD requires computing the natural logarithm for all entries of the two data sets. As both data sets need to have the same size, one requires to compute O(n) logarithms, which can then be weighted and added up in constant time. Thus O(n) time to compute δ KLD (U, V ) is needed, assuming that the distributions U and V feature n strictly positive entries, each.